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Abstract 

This paper presents a multi-swarm PSO algorithm for the Quadratic 
Assignment Problem (QAP) implemented on OpenCL platform. Our 
work was motivated by results of time efficiency tests performed for single¬ 
swarm algorithm implementation that showed clearly that the benehts of 
a parallel execution platform can be fully exploited, if the processed pop¬ 
ulation is large. The described algorithm can be executed in two modes: 
with independent swarms or with migration. We discuss the algorithm 
construction, as well as we report results of tests performed on several 
problem instances from the QAPLIB library. During the experiments the 
algorithm was conhgured to process large populations. This allowed us 
to collect statistical data related to values of goal function reached by 
individual particles. We use them to demonstrate on two test cases that 
although single particles seem to behave chaotically during the optimiza¬ 
tion process, when the whole population is analyzed, the probability that 
a particle will select a near-optimal solution grows. 

Keywords: QAP, PSO, OpenCL, GPU calculation, particle swarm 
optimization, mulit-swarm, discrete optimization 


1 Introduction 

Quadratic Assignment Problem (QAP) |19[ |S] is a well known combinatorial 
problem that can be used as optimization model in many areas diiaiii]. QAP 
may be formulated as follows: given a set of n facilities and n locations, the goal 
is to find an assignment of facilities to unique locations that minimizes the sum 
of flows between facilities multiplied by distances between their locations. As the 
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problem is NP hard m, it can be solved optimally for small problem instances. 
For larger problems (n>30), several heuristic algorithm were proposed |34l[^ IH]. 

One of the discussed methods IMIEI] is the Particle Swarm Optimization 
(PSO). It attempts to find an optimal problem solution by moving a population 
of particles in the search space. Each particle is characterized by two features its 
position and velocity. Depending on a method variation, particles may exchange 
information on their positions and reached values of goal functions [S]. 

In our recent work |33[ we have developed PSO algorithm for the Quadratic 
Assignment Problem on OpenCL platform. The algorithm was capable of pro¬ 
cessing one swarm, in which particles shared information about the global best 
solution to update their search directions. Following typical patterns for GPU 
based calculations, the implementation was a combination of parallel tasks (ker¬ 
nels) executed on GPU orchestrated by sequential operations run on the host 
(GPU). Such organization of computations involves inevitable overhead related 
to data transfer between the host and the GPU device. The time efficiency 
test reported in |33] showed clearly that the benefits of a parallel execution 
platform can be fully exploited, if processed populations are large, e.g. if they 
comprise several hundreds or thousands particles. For smaller populations se¬ 
quential algorithm implementation was superior both as regards the total swarm 
processing time and the time required to process one particle. This suggested 
a natural improvement of the previously developed algorithm: by scaling it up 
to high numbers of particles organized into several swarms. 

In this paper we discuss a multi-swarm implementation PSO algorithm for 
the QAP problem on OpenGL platform. The algorithm can be executed in two 
modes: with independent swarms, each maintaining its best solution, or with 
migration between swarms. We describe the algorithm construction, as well 
as we report tests performed on several problem instances from the QAPLIB 
library [55] ■ Their results show advantages of massive parallel computing: the 
obtained solutions are very close to optimal or best known for particular problem 
instances. 

The developed algorithm is not designed to exploit the problem specificity 
(see for example cni), as well as it is not intended to compete with supercom¬ 
puter or grid based implementations providing exact solutions for the QAP |5]. 
On the contrary, we are targeting low-end GPU devices, which are present in 
most laptops and workstations in everyday use, and accept near-optimal solu¬ 
tions. 

During the tests the algorithm was configured to process large numbers of 
particles (in the order of 10000). This allowed us to collect data related to 
goal function values reached by individual particles and present such statisti¬ 
cal measures as percentile ranks and probability mass functions for the whole 
populations or selected swarms. 

The paper is organized as follows: next Sectionj^discusses the QAP problem, 
as well as the PSO method. It is followed by Section which describes the 
adaptation of PSO to the QAP and the parallel implementation on the OpenGL 
platform. Experiments performed and their results are presented in Section ^ 
Section 1^ provides concluding remarks. 
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2 Related works 

2.1 Quadratic Assignment Problem 

Quadratic Assignment Problem was introduced by Koopmans and Beckman in 
1957 as a mathematical model describing assignment of economic activities to 
a set of locations m- 

Let V = be a set of locations (nodes) linked by arcs. Each 

arc linking a pair of nodes (fc, 1) is attributed with a non-negative weight dki 
interpreted as a distance. Distances are usually presented in form of n x n 
distance matrix D = [dki]- The next problem component is a set of facilities 
N = {1, ...,7T,} and a n x n non-negative flow matrix F = [fij], whose elements 
describe flows between pairs of facilities {i,j). 

The problem goal is to And an assignment tt: N ^ V that minimizes the 
total cost calculated as sum of flows fij between pairs of facilities {i,j) multiplied 
by distances (i 7 r(i) 7 r(j) between pairs of locations (7r(i), 7r(j)), to which they are 
assigned. The permutation tt can be encoded as binary variables Xki, where 
k = 7r(i), what gives the following problem statement: 

n n n n 

min EEEE fijdkl^ki^lj (1) 

z=l j=l k=l 1=1 

subject to: 

E"=i2;*j = 1, forl<j<n 

J2j=i Xij = 1, for 1 < i < n (2) 

Xij G {0,1} 

The n X n matrix X = [xki\ satisfying ([^ is called permutation matrix. 

In most cases matrix D and F are symmetric. Moreover, their diagonal 
elements are often equal 0. Otherwise, the component fudkkXkiXki can be ex¬ 
tracted as a linear part of the goal function interpreted as an installation cost 
of z-th facility at fc-th location . 

QAP models found application in various areas including transportation 
12], scheduling, electronics (wiring problem), distributed computing, statistical 
data analysis (reconstruction of destroyed soundtracks), balancing of turbine 
running |23j , chemistry , genetics |29] , creating the control panels and manufac¬ 
turing la¬ 
in 1976 Sahni and Gonzalez proved that the QAP is strongly NV-hard |30| . 
by showing that a hypothetical existence of a polynomial time algorithm for 
solving the QAP would imply an existence of a polynomial time algorithm for 
an AfV-complete decision problem - the Hamiltonian cycle. 

In many research works QAP is considered one of the most challenging op¬ 
timization problem. This in particular regards problem instances gathered in a 
publicly available and continuously updated QAPLIB library EEm]. A prac¬ 
tical size limit for problems that can be solved with exact algorithms is about 
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n = 30 |16| . In many cases optimal solutions were found with branch and bound 
algorithm requiring high computational power offered by computational grids 
|2] or supercomputing clusters equipped with a few dozen of processor cores and 
hundreds gigabytes of memory m- On the other hand, in m a very successful 
approach exploiting the problem structure was reported. It allowed to solve 
several hard problems from QAPLIB using very little resources. 

A number of heuristic algorithms allowing to find a near-optimal solutions 
for QAP were proposed. They include Genetic Algorithm [T], various versions 
of Tabu search [34], Ant Colonies [321112] and Bees algorithm m- Another 
method, being discussed further, is Particle Swarm Optimization pSl HI] . 

2.2 Particle Swarm Optimization 

The classical PSO algorithm [S] is an optimization method defined for continuous 
domain. During the optimization process a number of particles move through 
a search space and update their state at discrete time steps t = 1, 2,3,... Each 
particle is characterized by position x(t) and velocity v(t). A particle remembers 
its best position reached so far as well as it can use information about 

the best solution found by the swarm (t). 

The state equation for a particle is given by the formula (|^. Coefficients 
ci,C2,C3 S [0,1] are called respectively inertia, cognition (or self recognition) 
and social factors, whereas ri , r2 are random numbers uniformly distributed in 
[ 0 , 1 ] 

v{t -h 1) = Cl • v{t) + C 2 ■ r 2 (t) ■ {p^{t) - x(t)) + C3 • r3(f) • (p^{t) - x(t)) 

x{t + 1) = x{t) + v{f) (3) 

An adaptation of the PSO method to a discrete domain necessities in giving 
interpretation to the velocity concept, as well as defining equivalents of scalar 
multiplication, subtraction and addition for arguments being solutions and ve¬ 
locities. Examples of such interpretations can be found in [7] for the TSP and 
pS] for the QAP. 

A PSO algorithm for solving QAP using similar representations of parti¬ 
cle state was proposed by Liu et al. m- Although the approach presented 
there was inspiring, the paper gives very little information on efficiency of the 
developed algorithm. 

2.3 GPU based calculations 

Recently many computationally demanding applications has been redesigned to 
exploit the capabilities offered by massively parallel computing GPU platforms. 
They include such tasks as: physically based simulations, signal processing, ray 
tracing, geometric computing and data mining m- Several attempts have been 
also made to develop various population based optimization algorithms on GPUs 
including: the particle swarm optimization [36], the ant colony optimization [35], 
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the genetic and memetic algorithm |^. The described implementations 
benefit from capabilities offered by GPUs by processing whole populations by 
fast GPU cores running in parallel. 


3 Algorithm design and implementation 

In this section we describe the algorithm design, in particular the adaptation 
of Particle Swarm Optimization metaheuristic to the QAP problem, as well as 
a specific algorithm implementation on OpenGL platform. As it was stated in 
Section [2T^ the PSO uses generic concepts of position x and velocity v that can 
be mapped to a particular problem in various ways. Designing an algorithm for 
a GPU platform requires decisions on how to divide it into parts that are either 
executed sequentially at the host side or in parallel on the device. 

3.1 PSO adaptation for the QAP problem 

A state of a particle is a pair {X,V). In the presented approach both are 
n X n matrices, where n is the problem size. The permutation matrix X = [xij] 
encodes an assignment of facilities to locations. Its elements Xij are equal to 1, 
if j-th facility is assigned to i-th location, and take value 0 otherwise. 

A particle moves in the solution space following the direction given by the 
velocity V. Elements Vij have the following interpretation: if Vij has high 
positive value, then a procedure determining the next solution should favor an 
assignment Xij = 1. On the other hand, if Vij < 0, then xij = 0 should be 
preferred. 

The state of a particle reached in the t-th iteration will be denoted by 
{X{t),V(t)). In each iteration it is updated according to formulas Q and 

(§. 


V(t + 1) = Sy (ci • U(t) + C2 • r 2 {t) ■ {P^{t) - X{t))+ 


C3-r3{t)-{P^{t)-X{t))) 

(4) 

X{t + l) = Sx{X{t) + Vit)) 

(5) 


Goefficients r 2 and are random numbers from [0,1] generated in each it¬ 
eration for every particle separately. They are introduced to model a random 
choice between movements in the previous direction (according to ci - iner¬ 
tia), the best local solution (self recognition) or the global best solution (social 
behavior). 

All operators appearing in ([^ and ([^ are standard operators from the linear 
algebra. Instead of redefining them for a particular problem, see e.g. [7], we 
propose to use aggregation functions Sy and Sx that allow to adapt the algorithm 
to particular needs of a discrete problem. 
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The function Sy is used to assure that velocities have reasonable values. 
Initially, we thought that unconstrained growth of velocity can be a problem, 
therefore we have implemented a function, which restricts the elements of V to 
an interval [—VmaxTVraax]- This function is referred as raw in TableHowever, 
the experiments conducted showed, that in case of small inertia factor, e.g. Ci = 
0.5, after a few iterations all velocities tend to 0 and in consequence all particles 
converge to the best solution encountered earlier by the swarm. To avoid such 
effect another function that additionally performs column normalization was 
proposed. For each j-th column the sum of absolute values of the elements 
rij = calculated and then the following assignment is made: Vij <— 

Vij luj. 

According to formula (|^ a new particle position X{t -\- 1) is obtained by 
aggregating the previous state components: X{t) and V{t). As elements of a 
matrix X(t) + V(t) may take values from [—Vmax,Vmax + !]> the Sx function 
is responsible for converting it into a valid permutation matrix satisfying 
i.e. having exactly one 1 in each row and column. Actually, is rather a 
procedure, than a function, as it incorporates some elements of random choice. 

Three variants of Sx procedures were implemented: 

1. GlobalMax{X) - iteratively searches for a maximum element in a 
matrix X, sets it to 1 and clears other elements in the row r and c. 

2. PickColumn{X) - picks a column c from A, selects a maximum element 
Xrc, replaces it by 1 and clears other elements in r and c. 

3. SecondTarget{X, Z,d) - similar to GlobalMax{X), however during the 
first d iterations ignores elements Xij, such that Zij = 1. (As the parameter 
Z a solution X from the last iteration is used.) 

In several experiments, in which GlobalMax aggregation procedure was ap¬ 
plied, particles seemed to get stuck, even if their velocities were far from zero 
|33| . We reproduce this effect on a small 3x3 example: 
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If GlobalMax{X{t) + V{t)) is used for the described case, in subsequent 
iterations it will hold: A(t -|- 1) = X{t), until another particle is capable of 
changing {P^{t) — X{t))) component of formula (|^ for velocity calculation. 

A solution for this problem can be to move a particle to a secondary direction 
(target), by ignoring d < n elements that are in the solution X{t) already set 
to 1. This, depending on d, gives an opportunity to reach other solutions, 
hopefully yielding smaller goal function values. If chosen elements are maximal 
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in the remaining matrix, denoted here as X (dd V, they are still reasonable 
movement directions. For the discussed example possible values of X (Zd V 
matrices, where d = 1 and d = 2, are shown below Elements of a new 
solution are marked with circles, whereas the upper index indicates the iteration, 
in which the element was chosen. 


X(Z)iV = 


X02V = 


1 3 

0 4 ®^ 

2 2 

8 1 

®^ 4 6 

2 ( 3)1 2 


X02V 


8 1 
0 @1 6 
@^4 2 


( 6 ) 


It can be observed that for d = 1 the value is exactly the same, as it would 
result from the GlobalMax, however setting d = 2 allows to reach a different 
solution. The pseudocode of SecondTarget procedure is listed in Algorithm 


3.2 Migration 

The intended goal of the migration mechanism is to improve the algorithm 
exploration capabilities by exchanging information between swarms. Actually, 
it is not a true migration, as particles do not move. Instead we modify stored 

[A:] solutions (global best solution for a /c-th swarm) replacing it by randomly 
picked solution from a swarm that performed better (see Algortithm [^. 

The newly set P'^[k] value influences the velocity vector for all particles in 
fc-th swarm according to the formula Q . It may happen that the goal function 
value corresponding to the assigned solution P^[k]) is worse than the previous 
one. It is accepted, as the migration is primarily designed to increase diversity 
within swarms. 

It should be mentioned that a naive approach consisting in copying best 
P^ values between the swarms would be incorrect. (Consider replacing line 5 
of Algorithm with: P'^[sm-k-i] In such case during algorithm 

stagnation spanning over several iterations: in the first iteration the best value 
would be cloned, in the second two copies would be created, in the third 
four and so on. Finally, after k iterations 2^ swarms would follow the same 
direction. In the first group of experiments reported in Section]^ we used up to 
250 swarms. It means that after 8 iterations all swarms would be dominated by 
a single solution. 
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Algorithm 1 Aggregation procedure SecondTarget 
Require: X = Z + V - new solution requiring normalization 
Require: Z - previous solution 

Require: depth - number of iterations, in which during selection of maximum 
element the algorithm ignore positions, where corresponding element of Z 
is equal to 1 

1: procedure SECONDTARGET(A,Z,iiept/i) 

2 ; R i — { 1 ? ■ * ■ 5 

3 : C^{l,...,n} 

4: for i in (l,n) do 

5 : Calculate M, the set of maximum elements 

6: if z < depth then 

7 : Ignore elements Xij such that Zij = 1 

8: M ^ 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


{(r, c): Zrc ^ ^ ^ 

else 

M ^ {(r,c): VtfzRjfzcixrc > x,j)} 

end if 

Randomly select (r, c) from M 

R ^ R \ {r} o Update the sets R and C 

C^C\{c} 

for i in (1, n) do 

Xri ^0 [> Clear r-th row 

Xic ^0 > Clear c-th column 

end for 

Xrc 1 > Assign 1 to the maximum element 

end for 
return X 
end procedure 
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Algorithm 2 Migration procedure 

Require: d - migration depth satisfying d < ml2, where m is the number of 
swarms 

Require: P'^ - table of best solutions for m swarms 

Require: X - set of all solutions 
1: procedure MiGRATiON((i,P®,A) 

2: Sort swarms according to their values into 

a sequence (si, S 2 • ■ ■, Sm- 2 , Sm-i) 

3: for k in (1, d) do 

4: Randomly choose a solution Xkj 

belonging to the swarm Sk 
5: Assign P^[Sm-k-l] ^ Xkj 

6: Update the best goal function value for 

the swarm m — k — 1 
7: end for 

8: end procedure 


3.3 OpenCL algorithm implementation 

OpenCL |18| is a standard providing a common language, programming in¬ 
terfaces and hardware abstraction for heterogeneous platforms including GPU, 
multicore CPU, DSP and FPGA (3T]. It allows to accelerate computations by 
decomposing them into a set of parallel tasks (work items) operating on separate 
data. 

A program on OpenCL platform is decomposed into two parts: sequential 
executed by the CPU host and parallel executed by multicore devices. Functions 
executed on devices are called kernels. They are written in a language being a 
variant of C with some restrictions related to keywords and datatypes. When 
first time loaded, the kernels are automatically translated into the instruction 
set of the target device. The whole process takes about 500ms. 

OpenCL supports ID, 2D or 3D organization of data (arrays, matrices and 
volumes). Each data element is identified by 1 to 3 indices, e.g. for two- 

dimensional arrays. A work item is a scheduled kernel instance, which obtains 
a combination of data indexes within the data range. To give an example, a 2D 
array of data of n x m size should be processed by n • m kernel instances, which 
are assigned with a pair of indexes {i,j), 0 < i < n and 0 < j < m. Those 
indexes are used to identify data items assigned to kernels. 

Additionally, kernels can be organized into workgroups, e.g. corresponding 
to parts of a matrix, and synchronize their operations within a group using 
so called local barrier mechanism. However, workgroups suffer from several 
platform restrictions related to number of work items and amount of accessible 
memory. 

OpenCL uses three types of memory: global (that is exchanged between the 
host and the device), local for a work group and private for a work item. 

In our implementation we used aparapi platform m that allows to write 
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Figure 1: Functional blocks of OpenCL based algorithm implementation. 


OpenCL programs directly in Java language. The platform comprises two parts: 
an API and a runtime capable of converting Java bytecodes into OpenCL work¬ 
loads. Hence, the host part of the program is executed on a Java virtual ma¬ 
chine, and originally written in Java kernels are executed on an OpenCL enabled 
device. 

The basic functional blocks of the algorithm are presented in Fig. Imple¬ 
mented kernels are marked with gray color. The code responsible for generation 
of random particles is executed by the host. We have also decided to leave 
the code for updating best solutions at the host side. Actually, it comprises a 
number of native System. arraycopyO calls. This regards also the migration 
procedure, which sorts swarms indexes in a table according to their P® value 
and copies randomly picked entries. 

Particle data comprise a number of matrices (see Fig. A and 
solutions, - local best particle solution and V - velocity. They are all 
stored in large continuous tables shared by all particles. Appropriate table part 
belonging to a particle can be identified based on the particle id transferred to 
a kernel. Moreover, while updating velocity, the particles reference a table P*^ 
indexed by the swarm id. 

An important decision related to OpenCL program design is related to data 
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Figure 2: Global variables used in the algorithm implementation 


ranges selection. The memory layout in Fig. [^suggests 3D range, whose dimen¬ 
sions are: row, column and particle number. This can be applied for relatively 
simple velocity or goal function calculation. However, the proposed algorithms 
for Sx , see Algorithm are far too complicated to be implemented as a simple 
parallel work item. Finally we decided to use one dimension (particle id) for Sx 
and goal function calculation, and two dimensions (particle id, swarm id) for 
velocity kernels. 

It should be mentioned that requirements of parallel processing limits appli¬ 
cability of object oriented design at the host side. In particular we avoid creating 
particles or swarms with their own memory and then copying small chunks be¬ 
tween the host and the device. Instead we rather use a flyweight design pattern 
|13| : if a particle abstraction is needed, a single object can be configured to 
see parts of large global arrays X, V as its own memory and perform required 
operations, e.g. initialization with random values. 


4 Experiments and results 

In this section we report results of conducted experiments, which aimed at 
establishing the optimization performance of the implemented algorithm, as 
well as to collect data related to its statistical properties. 

4.1 Optimization results 

The algorithm was tested on several problem instances form the QAPLIB |28| . 
whose size ranged between 12 and 150. Their results are gathered in Table [T] 
and Table The selection of algorithm configuration parameters (ci, C 2 and 
C 3 factors, as well as the kernels used) was based on previous results published 
in [23]. In all cases the second target Sx aggregation kernel was applied (see 
Algorithm [^, which in previous experiments occurred the most successful. 

During all tests reported in Table [T] apart the last, the total numbers of 
particles were large: 10000-12500. For the last case only 2500 particles were 
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used due to 1GB memory limit of the GPU device (AMD Radeon HD 6750M 
card). In this case the consumed GPU memory ranged about 950 MB. 

The results show that algorithm is capable of finding solutions with goal 
function values are close to reference numbers listed in QAPLIB. The gap is 
between 0% and 6.4% for the biggest case tail50b. We have repeated tests 
for taidOb problem to compare the implemented multi-swarm algorithm with 
the previous single-swarm version published in m- Gap values for the best 
results obtained with the single swarm algorithm were around 7%-8%. For the 
multi-swarm implementation discussed here the gaps were between 0.64% and 
2.03%. 

The goal of the second group of experiments was to test the algorithm config¬ 
ured to employ large numbers of particles (50000-10000) for well known esc32* 
problem instances from the QAPLIB. Altough they were considered hard, all of 
them have been were recently solved optimally with exact algorithms |25l[IO]. 

The results are summarized in Table We used the following parameters: 
Cl = 0.8, C 2 = 0.5 C 3 = 0.5, velocity kernel: normalized, Sx kernel: second tar¬ 
get. During nearly all experiments optimal values of goal functions were reached 
in one algorithm run. Only the problem esc32a occurred difficult, therefore for 
this case the number of particles, as well as the upper iteration limits were 
increased to reach the optimal solution. What was somehow surprising, in all 
cases solutions differing from those listed in QAPLIB were obtained. Unfortu¬ 
nately, our algorithm was not prepared to collect sets of optimal solutions, so 
we are not able to provide detailed results on their numbers. 

It can be seen that optimal solutions for problem instances esc32c--h were 
found in relatively small numbers of iterations. In particular, for esc32e and 
es32g, which are characterized by small values of goal functions, optimal solu¬ 
tions were found during the initialization or in the first iteration. 

The disadvantage of the presented algorithm is that it uses internally ma¬ 
trix representation for solutions and velocities. In consequence the memory 
consumption is proportional to n^, where n is the problem size. The same re¬ 
gards the time complexity, which for goal function and Sx procedures can be 
estimated as o{n^). This makes optimization of large problems time consuming 
(e.g. even 400 sec for one iteration for tail50b). However, for for medium size 
problem instances, the iteration times are much smaller, in spite of large popu¬ 
lations used. For two runs of the algorithm bur26a reported in Table [T] where 
during each iteration 12500 particles were processed, the average iteration time 
was equal 1.73 sec. For 50000-10000 particles and problems of size n = 32 the 
average iteration time reported in Table was less than 3.7 seconds. 

4.2 Statistical results 

An obvious benefit of massive parallel computations is the capability of pro¬ 
cessing large populations (see Table . Such approach to optimization may 
resemble a little bit a brutal force attack: the solution space is randomly sam¬ 
pled millions of times to hit the best solution. No doubt that such approach 
can be more successful if combined with a correctly designed exploration mech- 
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Table 1: Results of tests for various instance from QAPLIB 
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Table 2: Results of tests for esc32* instances from QAPLIB (problem size 
n = 32). Reached optimal values are marked with asterisks. 


Instance 

Swarms 

Particles 

Total part. 

Goal 

Iter 

Time/iter [ms] 

esc32a 

50 

1000 

100000 

138 

412 

3590.08 

esc32a 

10 

5000 

100000 

134 

909 

3636.76 

esc32a 

50 

2000 

100000 

130* 

2407 

3653.88 

esc32b 

50 

1000 

50000 

168* 

684 

3637.84 

esc32c 

50 

1000 

50000 

642* 

22 

3695.19 

esc32d 

50 

1000 

50000 

400* 

75 

3675.32 

esc32e 

50 

1000 

50000 

2* 

0 

3670.38 

esc32g 

50 

1000 

50000 

6* 

1 

3625.17 

esc32h 

50 

1000 

50000 

438* 

77 

3625.17 


anism that directs the random search process towards solutions providing good 
or near-optimal solutions. In this section we analyze collected statistical data 
related to the algorithm execution to show that the optimization performance of 
the algorithm can be attributed not only to large sizes of processed population, 
but also to the implemented exploration mechanism. 

PSO algorithm can be considered a stochastic process controlled by random 
variables r 2 (t) and r 3 (t) appearing in its state equation Such analysis for 
continuous problems were conducted in jH]. On the other hand, the observable 
algorithm outcomes, i.e. the values of goal functions f{xi{t)) for solutions Xi, 
i = 1,..., n reached in consecutive time moments t € {1, 2,3,... } can be also 
treated as random variables, whose distributions change over time t. Our in¬ 
tuition is that a correctly designed algorithm should result in a nonstationary 
stochastic process {f{xi{t)): t G T}, characterized by growing probability that 
next values of goal functions in the analyzed population are closer to the optimal 
solution. 

To demonstrate such behavior of the implemented algorithm we have col¬ 
lected detailed information on goal function values during two optimization task 
for the problem instance bur26a reported in Table (cases 2 and 3). For both of 
them the algorithm was configured to use 250 swarms comprising 50 particles. 
In the case 2 the migration mechanism was applied and the optimal solution 
was found in the iteration 156, in the case 3 (without migration) a solution very 
close to optimal (gap 0.06%) was reached in the iteration 189. 

Fig-i shows values of goal function for two selected particles during run 
3. The plots show typical QAP specificity. PSO and many other algorithms 
perform a local neighborhood search. For the QAP the neighborhood is char¬ 
acterized by great variations of goal function values. Although mean values of 
goal function decrease in first twenty or thirty iterations, the particles behave 
randomly and nothing indicates that during subsequent iterations smaller values 
of goal functions would be reached more often. 

In Fig. 1^ percentile ranks (75%, 50% 25% and 5%) for two swarms, which 
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Figure 3: Variations of goal function values for two particles exploring the so¬ 
lutions space during the optimization process (bur26a problem instance) 


reached best values in cases 2 and 3 are presented. Although the case 3 is 
characterized by less frequent changes of scores, than the case 2, probably this 
effect can not be attributed to the migration applied. It should be mentioned 
that for a swarm comprising 50 particles, the 0.05 percentile corresponds to just 
two of them. 
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Figure 4: Two runs of bur26a optimization. Percentile ranks for 50 particles 
belonging to the most successful swarms: without migration (above) and with 
migration (below). 

Collected percentile rank values for the whole population comprising 12500 


15 
























































particles are presented in Fig. For both cases the plots are clearly sepa¬ 
rated. It can be also observed that solutions very close to optimal are practi¬ 
cally reached between the iterations 20 (37.3 sec) and 40 (72.4 sec). For the 
whole population the 0.05 percentile represents 625 particles. Starting with the 
iteration 42 their score varies between 5.449048 • 10® and 5.432361 • 10®, i.e. by 
about 0.3%. 


■lO" 



iteration 

10 ® 



iteration 


Figure 5: Two runs of bur26a optimization. Percentile ranks for all 12500 
particles: without migration (above) and with migration (below). 

Fig. i shows, how the probability distribution (probability mass function 
PMF) changed during the optimization process. In both cases the the opti¬ 
mization process starts with a normal distribution with the mean value about 
594500. In the subsequent iterations the maximum of PMF grows and moves 
towards smaller values of the goal function. There is no fundamental difference 
between the two cases, however for the case 3 (with migration) maximal values 
of PMF are higher. It can be also observed that in the iteration 30 (completed 
in 56 seconds) the probability of hitting a good solution is quite high, more then 
10 %. 

Interpretation of PMF for the two most successful swarms that reached best 
values in the discussed cases is not that obvious. For the case without migration 
(Fig. above) there is a clear separation between the initial distribution and 
the distribution reached in the iteration, which yielded the best result. In the 
second case (with migration) a number of particles were concentrated around 
local minima. 

The presented data shows advantages of optimization performed on mas- 
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Figure 6: Probability mass functions for 12050 particles organized into 250 x 
50 swarms during two runs: without migration (above) and with migration 
(below). 


sive parallel processing platforms. Due to high number of solutions analyzed 
simultaneously, the algorithm that does not exploit the problem structure can 
yield acceptable results in relatively small number of iterations (and time). For 
a low-end GPU devices, which was used during the test, good enough results 
were obtained after 56 seconds. It should be mentioned that for both presented 
cases the maximum number of iterations was set to 200. With 12500 particles, 
the ratio of potentially explored solutions to the whole solution space was equal 
200-12500/26! = 6.2 • lO-^P 

5 Conclusions 

In this paper we describe a multi-swarm PSO algorithm for solving the QAP 
problem designed for the OpenCL platform. The algorithm is capable of pro¬ 
cessing in parallel large number of particles organized into several swarms that 
either run independently or communicate with use of the migration mechanism. 
Several solutions related to particle state representation and particle movement 
were inspired by the work of Liu at al. m, however, they were refined here to 
provide better performance. 

We tested the algorithm on several problem instances from the QAPLIB li¬ 
brary obtaining good results (small gaps between reached solutions and reference 
values). However, it seems that for problem instances of large sizes the selected 
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Figure 7: Probability mass functions for 50 particles belonging to the most suc¬ 
cessful swarms during two runs: without migration (above) and with migration 
(below). One point represents an upper bound for 5 particles. 


representation of solutions in form of permutation matrices hinders potential 
benefits of parallel processing. 

During the experiments the algorithm was configured to process large pop¬ 
ulations. This allowed us to collect statistical data related to goal function 
values reached by individual particles. We used them to demonstrate on two 
cases that although single particles seem to behave chaotically during the opti¬ 
mization process, when the whole population is analyzed, the probability that 
a particle will select a near-optimal solution grows. This growth is significant 
for a number of initial iterations, then its speed diminishes and finally reaches 
zero. 

Statistical analysis of experimental data collected during optimization pro¬ 
cess may help to tune the algorithm parameters, as well as to establish realistic 
limits related to expected improvement of goal functions. This in particular 
regards practical applications of optimization techniques, in which recurring 
optimization problems appear, i.e. the problems with similar size, complexity 
and structure. Such problems can be near-optimally solved in bounded time on 
massive parallel computation platforms even, if low-end devices are used. 
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